Neural Networks and the Bayesian Extension
Bayesian Analysis - University of Utah
Introduction
The purpose of this write up is to summarize my study and exploration of Neural Networks (NN) and how principles of Bayesian analysis can be used and applied in this framework. The primary resource for this project was Probabilistic Deep Learning with Python, Keras, and TensorFlow Probability (Dürr, Sick, and Murina (2020)), though additional resources were used in understanding the material. Additionally, I will be using this write up to clarify and expound on information touched on in my final presentation for the class. Mainly, I hope to dive deeper into:
- loss functions and how they work in the NN framework.
- Gradient adaptation and weights for training data.
- Flexible models and Normalizing Flow (NF)
- Variation inference (VI) and MC dropout architectures.
This write up will be structured to follow along with my final presentation and fill in the gaps that my slides and initial explanations left.
Part 1 - The Basics of Deep Learning and the Likelihood Approach
In order to understand deep learning, we have to understand the basics of the models that we are attempting to fit and to use. A Deep Learning (DL) system is an extension of the NN framework. It typically involves more hidden layers than a traditional NN. The development of GPU’s for video games allowed for faster computing times and increased development of NN exponentially. These types of models attempt to mimic the function of a human brain. They do this by working through layers of information and have gotten really good a solving classification problems.
The above image illustrates how a DL model works. You have variables that you have determined to be important or at least easy to observe, the “input layer”, that are placed into the model. This input layer is then passed to a series of “hidden layers” that work to understand the input and gain information from the layer before it, before finally outputting the information that was learned to the output layer. One of the benefits of a DL model is that it can offer some degree of uncertainty to its predictions. This is most evident when the models are being used for classification where it can return probabilities of how likely a certain classification is, i.e. is the image submitted a picture of a dog, or of a cat.
Creating a DL system usually takes two steps, with many little hidden steps in each:
- Choosing a NN architecture
- Tuning the weights of the model using training data
- This is where concepts of the loss function and backpropegation come in.
- It can be the most confusing part.
In thinking of how to best explain these two steps, I will first explain some of the different NN architectures, explain how the loss function and backpropegation work, and then finally use the example of creating a model that predicts whether or not to give someone a loan based on their current income and credit score would hypothetically work.
NN architectures
A neural network can be compared to a regression model. A simple regression model can be written as:
\[ Y = \beta_i X_i + \varepsilon_i \]
where\(Y\)is the outcome,\(\beta_i\)represents the regression parameters or slopes,\(X_i\)is the data, and\(\varepsilon_i\)represents unmeasured noise. The comparable NN to this model looks like:
where we now have our output represented by\(p_1\)is comparable to\(Y\)in our regression model,\(x_i\)represents the observed data just like our regression model, the arrows would represent our regression parameters, and “bias” is most comparable to the intercept or could even be thought of as the error term. It is probably best to think of it as some parameter that is unaffected by the data itself. The output exists in what is called a “neuron”. The neuron takes the data, and outputs something.
There were three types of networks that were really discussed by Dürr, Sick, and Murina (2020). They are:
- Fully connected neural networks (fcNN)
- Convolutional neural networks (CNN)
- One-dimensional CNN
fcNN
The fcNN architecture connects each layer to every neuron in the next. In the picture below, we can see that each neuron in the input layer is connected to every neuron in the hidden layer. And every neuron in the hidden layer is connected to each neuron in the output layer. Each of these neurons takes information from the previous layer, and then applies weights to it. In the below picture, our two inputs would get passed to the hidden layer where each neuron has a different weight and would create two new inputs, often denoted as\(z_i\)but that I will denote as\(z_{ih}\), where\(i\)represents the number of neurons in the layer and\(h\)represents the hidden layer that it was calculated on, which are then passed onto the next layers. When they are passed to the output layer, the\(z_{ih}\)are converted to the final outputs, which I will call\(z_j\)where the\(j\)represents the number of classes of interest. In the fcNN we are looking at, the\(z_{ih}\)for each of the\(i\)neurons are passed to the output layer, converted to a new set of\(z_j\). In a classification problem, the softmax equation (\(p_i = e^{z_i}/\sum_j e^{z_j}\)) takes the final output,\(z_j\), and converts it to a probability that each class is the correct classification. A detailed example of this principle will be shown later. As you can imagine, however, this type of model can get pretty large and complutationally intensive pretty quickly. Other models attempt to overcome this.
CNN
The CNN model was developed to help reduce the computational intensity of the fcNN. Instead of connecting all of the neurons to all of the neurons, it only connects some neurons to other neurons. The best way to think of it, I think, is to imagine a picture (shown below) of a dog. A fcNN would look at the full picture, which works great for our brains, but not so great for our computers. A would look at different parts of the picture and put the information together in the output layer.
One-dimensional CNN
A one-dimensional CNN behaves similarly to a CNN, but it assumes that the order of the data is important. For example, this model could be a large language model, which is used to predict the next word in this . The words leading up the are important in predicting what the word is, as changing the word can change the probability of a certain word being correct. It can be used with time data as well.
Other networks
There are additional network architectures, but they tend to follow the same basic principles and were not discussed by Dürr, Sick, and Murina (2020).
Tuning the weights and training the data
We can think of NN’s as an extension of regression models being used for prediction. As with other simpler models, we first give the NN a training data set with all the right answers so that it can chose the optimal weights for each of the hidden layers. How does the model know which way to adjust the weights? That depends on the loss function that you chose. The loss function can change based on the data that you are working with, but a good example is the MSE.
\[ loss = MSE= \frac 1 n \sum_{i=1}^n (y_i - \hat y_i)^2 = \frac 1 n \sum_{i=1}^n (y_i - (a \cdot x_i +b))^2 \]
In this case, we will focus on the third part of the equation, with\(a\)and\(b\)in it. Our goal is to find the values of\(a\)and\(b\)that minimize this loss. This isn’t too hard of a problem with simple models. Here we would just solve:
\[ \begin{split} \frac{\partial loss}{\partial a} = 0 \\ \frac{\partial loss}{\partial b} = 0 \end{split} \]
But, this isn’t always a simple problem to solve, and the with NN, often there isn’t a simple closed form solution to this problem. Because of this, another method was developed to minimize loss when working with NN. The most common method is called the gradient descent method, which tests different points within the model until it finds the minimized loss. This can also get a little tricky, but usually works pretty well. When using this method to solve for one parameter, Dürr, Sick, and Murina (2020) describe it as a blind mathematician walking slowly into a valley until he finds the lowest point. In the case of the loss function previously written and attempting to minimize for\(a\), we can use the following update rule:
\[ a_{t+1} = a_t - \eta \cdot grad_a (loss) \]
where\(a_{t+1}\)is the updated parameter,\(a_t\)is the last parameter we’d calculated under our rule,\(grad_a\)is the gradiant of the loss function with respect to\(a\), and\(\eta\)is our learning rate and controls how fast we jump around trying to find the minimized loss.
We can then extend this idea to estimating both parameters\(a\)and\(b\). This would result in an update formula:
\[ \begin{split} a_{t+1} & = a_t - \eta \cdot grad_a \\ b_{t+1} & = b_t - \eta \cdot grad_b \end{split} \]
When we treat\(b\)as fixed, we can find the gradient as:
\[ grad_a = \frac{\partial}{\partial a} \frac 1 n \sum_{i=1}^2 (a \cdot x_i + b - y_i)^2 = \frac 2 n \sum_{i=1}^n (a \cdot x_i + b - y_i) \cdot x_i \]
We can do the same by keeping\(a\)fixed. If we’re doing that, we can find the gradient for\(b\)to be:
\[ grad_b = \frac{\partial}{\partial b} \frac 1 n \sum_{i=1}^2 (a \cdot x_i + b - y_i)^2 = \frac 2 n \sum_{i=1}^n (a \cdot x_i + b - y_i) \]
We plug these values into the gradient rule functions we explained earlier, and viola, we start to find the minimum values for our lose functions.
It should be noted here that our choice of\(\eta\)is important. If we choose something too big, there’s a chance we never find the optimal values of\(a\)and\(b\), but with an\(\eta\)that is to small, we need many more update steps. I would argue that it is better to have a too small learning parameter than a too large one, though this would obviously require more data. (Additionally, Dürr, Sick, and Murina (2020) state that if the loss starts to get too big, a good rule of thumb is to divide your current learning rate by 10 and see what happens.)
What we have done so far is pretty commonly done in simple linear regression. There are some additional tricks we can apply.
Mini-batch gradient descent or Stochastic Gradient Descent (SGD)
This process involves randomly selecting data points from the data, which is sometimes called a mini-batch, to get an unbiased estimator of the data. Often, GPUs have limited memory so this helps resolve that problem. It can happen that we don’t find the global minimum, but for some reason that is not completely understood, this procedure works pretty well whe fitting DL models.
SGD variants that speed up the learning
There are two prominent algorithms, RMSProb and Adam, that are included in most DL framework. The methods aren’t all that different from SGD, but often speed up the process. Dürr, Sick, and Murina (2020) doesn’t delve into the differences between all these methods, but Goh (2017) provides a good overview of SGD and other methods previously mentioned. I still haven’t fully understood the difference between RMSProb and Adam, but after skimming the article by Goh (2017), I think it has to do with the addition of a diagonal matrix,\(\Gamma_i^k\), which allows different step sizes for different directions. Our update formulas would look like this (first in the notation of Goh (2017) and then an attempt to translate it to our notation):
\[ w^{k+1} = w^0 + \sum_i^k \Gamma_i^k \nabla f(w^i) \Rightarrow a_{t+1} = a_t - \sum_i^k \Gamma_i^k grad_a \]
If I am understanding this correctly,\(\Gamma_i^k\)takes the place of\(\eta\)and we changes somehow based of the gradient we are looking at. I want to look more into this, but have started to run out of time.
Automatic differentiation
Basically, we use the chain rule,\(\left( f(g(x)) \right)' = f'(g(x)) \cdot g'(x)\), to update our parameter values with the update rule. This is were the Backpropagation, or reverse-mode differentiation, comes into play. Here, we basically start with the final loss value and move from our outer layer to our input layer to get our gradient with respect to each of our model parameters.
An example, with the help of ChatGPT
After my presentation, I was thinking through the role of the loss function, backpropogation, the gradient, etc… and I was having a hard time understanding the concept. Once I got better at thinking it through, I talked to ChatGPT and it helped create a simple scenario to illustrate how the loss function and a basic NN would work.
Let’s set up the following scenario. Let’s pretend to have set up a basic NN with the purpose of deciding whether or not some one gets a lone. We have two inputs (\(x_1 = 4.0\),\(x_2 = 0.82\)) which represent income and credit score (divided by 100). Our hidden layer has 4 neurons that use a Rectified Linear Unit (ReLU) that was suggest by ChatGPT. ReLU translates input values so that if the input is positive, it stays the same, but if the input is zero or negative it becomes 0.
\[ ReLU(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} \]
Finally, we have 3 neurons on our output layer: approved, Needs Review, and Denied. We assume this is a fcNN
Our NN looks like this:
Our hypothetical person should have been denied. Right now, our hidden layers have the following weights and bias:
| Neuron | \(w_1\) | \(w_2\) | \(b\) |
|---|---|---|---|
| \(h_1\) | 0.5 | 0.3 | -0.4 |
| \(h_2\) | -0.2 | 0.8 | 0.1 |
| \(h_3\) | 0.1 | -0.5 | 0.2 |
| \(h_4\) | 0.7 | 0.6 | -0.3 |
We pass\(x_1\)and\(x_2\)to the hidden layers and get our\(z\).
\[ \begin{split} h_1: \; z = 0.5(4.0)+0.3(0.82)−0.4 & =2.0+0.246−0.4=1.846 \\ ReLU(1.846) & = 1.846 \\ h_2: \; z = −0.2(4.0)+0.8(0.82)+0.1 & =−0.8+0.656+0.1=−0.044 \\ ReLU(−0.044) & = 0 \\ h_3: \; z =0.1(4.0)−0.5(0.82)+0.2 & =0.4−0.41+0.2=0.19 \\ ReLU(0.19) & =0.19 \\ h_4: \; z =0.7(4.0)+0.6(0.82)−0.3 & =2.8+0.492−0.3=2.992 \\ ReLU(2.992) & =2.992 \end{split} \]
Our output from the hidden layer is:
\[ z_h = [1.846, 0, 0.19, 2.992] \]
We pass this to the Output Layer, which will have four weights, one for each part of\(z_h\).
| Output Neuron (Class) | \(w_1\) | \(w_2\) | \(w_3\) | \(w_4\) | Bias \(b\) |
|---|---|---|---|---|---|
| Approved | 0.6 | -0.1 | 0.3 | 0.8 | 0.2 |
| Needs Review | -0.3 | 0.5 | 0.1 | 0.4 | -0.1 |
| Denied | 0.2 | -0.4 | -0.5 | 0.1 | 0.0 |
In much the same way as with the hidden layer, we can calculate our raw output as logit to get our final output \(z\) as:
\[ z = [3.7582, 0.562, 0.5734] \]
We exponentiate \(z\) and apply the softmax formula to calculate the probabilities.
\[ e^z = [42.89, 1.754, 1.774] \]
We note that the \(sum(e^z) = 46.418\) and get the following probabilities:
\[ p_{loan} = \left[\frac{42.89}{46.418}=0.924, \frac{1.754}{46.418}=0.038, \frac{1.774}{46.418}=0.038 \right] \]
which represent the probabilities associated with the new subject being approved, needing further review, or being denied.
Our model as is predicts that this new subject should be approved with 92.4% confidence. That’s some pretty big confidence just to be wrong.
Not to worry though, we are still training the data. We will use the categorical cross-entropy (suggested by ChatGPT) to calculate loss.
\[ loss = -\sum_i y_i \log(\hat y_i) \]
We take the probability of the correct class (being denied):
\[ -\log(0.038) = 3.27 \]
We can now calculate the gradient of the loss with respect to each output logit:
\[ \begin{split} \frac{\partial loss}{\partial z} & = \hat y_i - y_i \\ & = [0.924 - 0, 0.038 - 0, 0.038 - 1] \\ & = [0.924, 0.038, -0.96s] \end{split} \]
We pick the learning rate \(\eta = 0.1\). With this, we go from the output layer to our hidden layer (backpropigation in action) using a variation of the formula that we have previously defined (and treating \(x_i\) as a vector) we get a whole new set of weights:
\[ w_{approved} = w_{ij} - \eta \frac{\partial loss}{\partial z} \cdot h_i \]
When we use this formula, we get a list of weights to be used for \(z_h\) passed from the hidden layers. This gives us the new weights with the bias remaining the same:
| Output Neuron (Class) | \(w_1\) | \(w_2\) | \(w_3\) | \(w_4\) | Bias \(b\) |
|---|---|---|---|---|---|
| Approved | 0.429 | -0.1 | 0.282 | 0.524 | 0.108 |
| Needs Review | -0.3 | 0.5 | 0.1 | 0.4 | -0.1 |
| Denied | 0.378 | -0.4 | -0.4817 | 0.388 | 0.0962 |
If we rerun this without changing anything with the hidden layer, we would get:
\[ z = [2.52, 0.562, 1.8627] \]
And after using the softmax formula:
\[ p_{loan} = [0.602, 0.085, 0.313] \]
Which still favors approving the individual, but has greatly decreased the probability of them being approved for a loan. If we were to move forward with this training, the predictions would get better and better.
| Class | Before | After | Change |
|---|---|---|---|
| Approved | 92.4% | 60.2% | Decrease |
| Needs Review | 3.8% | 8.5% | Increase |
| Denied | 3.8% | 31.3% | Increase |
It should be noted that this isn’t a perfect example, but should help in understanding a simplified version of the algorithm that is being applied.
The Likelihood Approach
One of the questions that remains is how do we find the loss function? Traditional NN and DL models us the maximum likelihood approach. The loss function that is chosen can vary and depends on the type of question we are trying to answer. Let’s use the example of a six-sided die with a $-sign replacing some of the sides. Let’s say we want to use an NN that predicts the number of dollar signs on the dice if we know that we got exactly two dollar signs in exactly ten throws.
We would take the equation:
\[ P(\text{training}) = \Pi_{\text{j with }y=0} p_0(x_j) \Pi_{\text{j with }y = 1} p_1(x_j) \]
and tune the weights of our model to maximize this equation. You might recognize this as a classic classification problem. Our next step is take the log of the function:
\[ \log(P( \text{training} )) = \sum_{\text{j with }y=0} \log p_0(x_j) + \sum_{\text{j with }y = 1} \log p_1(x_j) \]
Finally, we need to make the jump from our maximum log likelihood to our crossentropy formula. We do this by making the log-likelihood scale independent (by multiplying by \(1/n\)) and then getting the minimum value. This leaves us with a crossentropy equation of:
\[ \text{crossentropy} = - \frac 1 n \left( \sum_{\text{j with }y=0} \log p_0(x_j) + \sum_{\text{j with }y = 1} \log p_1(x_j) \right) \]
This equation works for two classification classes, but it can be extended to work for a multinomial distribution. :
\[ P(Y = k|x,w) = \left\{\begin{aligned} p_0(x,w) \; for \; k=0 \\ p_1(x,w) \; for \; k=1 \\ \vdots \\ p_9(x,w) \; for \; k=9 \end{aligned} \right. with \; \sum_{i=0}^9 p_i(x,w) = 1 \]
Often in classification problems, a variation of this is used called the Kullback-Leibler Divergence:
\[ \text{crossentropy} = - \frac 1 n \sum_{i=1}^n p_i^{\text{true}} \cdot \log(p_i) \]
where \(p_i^{\text{true}}\) is 1 for the true class of the training example and 0 for the other components. (To see this in action, see the loan example.)
Flexible Probability Distributions and Normalizng Flows
We are now getting to the section of the book that started to mess with my brain the most. In this section we will discuss to additional and important extensions for NN:
- Flexible probability distributions
- Normalizing flows (NF)
Flexible probability distributions
Real world data can start to get kind of messy. Because of this, mixture distributions have been used to make models more flexible. Often times this models are autoregressive, since they assume the the distribution can be estimated using previous information.
Take for example WaveNet by Google. Additional information about WaveNet can be found on Google’s text-to-speech website. To summarize this model very briefly, Google wanted to generate realistic sounding voices for text-to-speech. They used a specialized version of CNN called a dilated convolution. Say we have some raw audio, for each time \(t\) we discretize the sound data. (Dürr, Sick, and Murina (2020) say that 16-bit is used for this, but I’m not sure what this means yet.) This means that at time \(t\), our audio signal \(x_t\) takes on a discrete value from \(0\) to \(2^{16} - 1 = 65,535\). In WaveNet, they then assumed that \(x_t\) depends on the audio signal samples earlier in time, making this a one-dimensional CNN. This means that:
\[ P(x_t) = P(x_t|x_{t-1}, x_{t-2}, ... , x_0) \]
so we can sample values of \(x_t\) from previous values. This is called an autoregressive model, meaning that our input depends on the previous observations and inputs. The WaveNet people essentially added a start sequence of audio values to the trained WaveNet and then predicted \(P(x_t)\) based on the start values provided.
Another example of a flexible autoregressive model is PixelCNN, which was created by OpenAI and applies the principles already discussed, but to a picture instead. This means that we predict a pixel color based on the past pixel colors we have observed.
These models created mixture distributions, specifically discretized logistic mixture models. They took a logistic distribution, gave it limits that it could work in (i.e. \(0\) to \(2^16 - 1\)) and quantized it.
Then, the mixed a couple of distributions together, as seen below:
By creating a mixture of simple base distributions, we have created a flexible way to model much more complex distributions. This works great in low-dimensional spaces, but what can we do if we want to work in higher-dimensional space?
Normalizing flows (NFs)
What is an NF? Honestly, I still don’t get it 100%, but I’m starting to get there. “IN a nutshell, an NF learns a transformation (flow) from a simple high-dimensional distribution to a complex one.” (Dürr, Sick, and Murina (2020)) Which sound really impressive, but still isn’t the most clear. The goal of an NF is to fit a complex distribution without picking a distribution in advanced or setting up a mixutre of several distributions. These are used in facial image generation, and we can draw or pick a new facial image from this complex distribution. How does this work? Basically, an NF using Jacobian transformations to get from a simple distribution, \(p_z(z)\) to the complex distribution, \(p_x(x)\). This can be direct say by going from \(z \rightarrow x\) or more complex by going from \(z \rightarrow y \rightarrow ... \rightarrow x\). The paper “Density Estimation using Real NVP” by Dinh, Sohl-Dickstein, and Bengio (2016) provides a great introduction to these types of models and some additional extensions that have been made. Though, I won’t lie, I haven’t yet had a chance to read and understand this article.
Flow models are pretty good at playing with images. For example, see the images below:
They basically take a face, and within similar looking faces, can change the image little by little and extrapolate information based on these models. The normalizing flow takes the simple input and in the more complex x-distribution, models that would need to change and outputs the x distribution.
There’s a lot of cool stuff going on in the traditional DL models.
Part 2 - The Benefits of Bayesian
But, the traditional DL models are not perfect. One of the bigger failings in traditional DL models is that they are not great in extrapolating when data is sparse. For example, Dürr, Sick, and Murina (2020) talk about image classification. Say we want to develop a NN that can identify animals. We can train a model that gets really good at distinguishing between different animals.
In the above image, there was a model called VGG16 that became really good at image classification. It would be able to tell the difference between these two animals, and even got to the point that it could classify the type of dog (Affenpinscher) and the type of elephant (Tusker). The question now is, can the model identify the elephant in the room? The answer was no. Why? The training images had no pictures of elephants in rooms and it would try to extrapolate the information, but it had no way to measure uncertainty for an image that was unlike anything it was trained on.
The Bayesian method offers a chance to a account for the uncertainty in areas when we have less data by leveraging prior distributions.
The Coin Toss Example
Let’s say we are making an NN that predicts the next value of a coin. We assume that the coin is fair, but we start using our flips as the training data. We flip the coin twice and the outcome is heads both times.
Using the traditional maximum likelihood approach, based on the data we have, the NN would predict that the third toss would be heads as well. We can look at this a little in the graphs below:
Since all of our data says that the coin only lands on head, our maximum likelihood based model also predicts with 100% certainty that the third coin toss will result in heads.
When we use the Bayesian approach, however, we can use a prior distribution to get an added estimate of our uncertainty. In this same case, we assume the coin to be fair (having the same probability of landing heads or tails) and use the appropriate prior predictive distribution. Let’s look at the same graphs under this approach and see how things change:
We can see in the posterior predictive distribution that our model would also predict the coin would land on heads, but unlike the previous approach, there is some uncertainty that is involved. The BNN would predict the next toss to be heads with an 80% chance.
As we get more and more data, our model becomes less and less dependent on the prior information except in areas where the data may be sparse.
Bayesian Neural Networks (BNN) Architecture
One of the main differences between BNN and the traditional NN is that the weights of the different parts of the model are replaced by a distribution, but how can we approximate these distributions in a timely manner that won’t make our computer blow up?
There are two different BNN methods that were discussed in Dürr, Sick, and Murina (2020):
- Variation inference (VI)
- Monte Carlo dropout (MC dropout)
We will discuss briefly how these two methods work and how they differ from each other.
Variation inference (VI)
The main idea behind VI is that we approximate the weight distributions with something called a variational distribution. The default in most DL software (like Tensor Flow Probability(TFP)) is the Gaussian variational distribution. These means that the model has to learn two weight distributions, the mean \(w_\mu\) and variance \(w_\sigma\). For the prior, the standard normal \(N(0,1)\) is commonly used. As more information is gathered, the weights for the variational distributions are updated. This process minimizes the difference between our posterior distribution and the true distribution of the weights that we need. Often, this is done using the Kullback-Leibler distribution.
Monte Carlo dropout (MC droupout)
This is the other method used to create BNN. It should be noted that BNNs have twice as many parameters since we are estimating a mean and a standard deviation. This can be kind of taxing on a computer, so we randomly chose neurons to be set to zero. This was demonstrated in 2014 to prevent overfitting by Srivastava et al. (2014) in traditional NNs. This idea was later applied to BNNs by Gal and Ghahramani (2016), and it worked pretty well. We can get an idea of how this works by looking at the image below:
MC dropout is widely used in training data. This isn’t the only way to use it, however, as we can also use it in testing datasets.
How does this work? Gal and Ghahramani (2016) used a framework similar to VI. (They had an additional regulation parameter, but that is commonly removed in practice, according to Dürr, Sick, and Murina (2020)). The main difference between the MC dropout and VI methods is the addition of the tuning parameter \(p^*\), which represents the proportion of neurons to be set to \(0\) in the training and testing stages of the model creation. If the model isn’t behaving how you want it to, you can simply adjust \(p^*\) and try again.
One final thing I would like to not about BNN is that we can get a predictive distribution by averaging over the weight distributions of the node. This distribuiton looks like:
\[ p(y| x_{test}, D) = \sum_i p(y|x_{text}, w_i) \cdot p(w_i|D) \]
where \(x_{test}\) represents the test inputs, \(w_i\) represents the weight parameters at the different neurons, and \(D\) represents the data that you used to train the model. By using the same test inputs, \(x_{test}\), many times over, \(T\), and passing the input through the dropout model, we can combine our dropout predictions to get the Bayesian predictive distribution:
\[ p(y|x_{test}, D) = \frac 1 T \sum_{t=1}^T p(y|x_{test}, w_t) \]
This final predictive distribution captures our uncertainty. Dürr, Sick, and Murina (2020) describes this uncertainty as “epistemic and … aleatoric”, which means it covers our lack of knowledge and chance imbalances.
Conclusion
We have explored over the course of this paper the structure of NNs and how they have continued to improve. Overall, NNs have been getting better and better at predicting and classifying different objects and will continue to do so. The benefit of traditional NNs can only take us so far, and they often fail to report uncertainty in areas where training data is sparse or even non-existent. The Bayesian application to NN (BNN) can overcome this and help provide uncertainty when extrapolating predictions. As we continue to explore novel methods, our ability to predict and see these new methods will only continue to improve.